Adding necessary imports

"Plotly" library is used for visualization.

Method for reading the dataset of i-th User

Read CSVs with memory optimization

If the CSV files are readed without any optimization, columns are taken as float64, so the CSV occupies a lot of memory. In order to reduce memory, this procedure changes dtype parameter and optmizes it. For that purpose, a chunk is read and then types are optimized.

Task 1: Read dataset

Reading all dataset iteratively and saving in a list: df_users, an indication of file reading is printed during the read process. Note that, the timestamp feature has been converted to datetime and stored in the dataframe.

Task 2 & 3: Dataset overview

In the first figure we see the volume of data/input per user and observe that User 7 contains most and User 2 contains least amount of data.

In the following figure we plot the keystroke trend per user over time. First the date_time feature is grouped with daily frequency then keystroke_counter is summed up. Finally the output is plotted in the figure. Note that, User x's (x:0 to 11) input start and end date are also informed in the figure's legend.

Here the date_time feature is grouped again with daily frequency but now the size of the daily volume of input is taken into account.

Task 4 & 5

Considering the inputs of User 0 for this task. Note that the hour of the day feature has been added in the column hour_of_day from date_time feature.

User activity: to define user activity we consider the sum of following features:-

Activity of user

We would like to learn how active an user over the course of the day. Therefore we calculate user_activity from hour_of_day.

Activeness of User 0 over the day

In this figure, the activeness of User 0 over the day has been plot.

From this histogram user's sleep activity can be distinguished.

We observe that there is no or rare activity from 00:00 untill 08:00, therefore it might be User 0's sleep time.

Also the user is active after waking up highly active during midday, before dayend and at night until midnight.

Classification of activeness of User 0

Now we are interested in dividing these activeness into states like: fully-active (having a considerable number of interactions via mouse, keyboard), middle, and passive (very few interactions). Because our data has no level, K-means Clustering approach is used to cluster the data.

Step 1: Prepare the dataset

Step 2: Data pre-process/scaling

Step 3: Process the data in K-Means method

Step 4: Prepare the categories based on the centroids

Step 5: Cluster the dataset using centroids and show output

So we have clustered the data into three categories as plotted in above figure. High activity, middle activity and bare activity are shown in green, red and blue colors respectively.

As the levels are now know, we can use regression to predict future activeness of the user.

Task 6: Probability of switching among states

Calculating the probability of switching between states of User 0. We will use the outcome and levels in previous section for the calculation.

There are 24 states of User 0 after groupped by hour_of_day.

Following table shows the calculation of switching probability from the states in the first column to the states in first row.

fully-active middle passive
fully-active 2 ÷ 6 = 0.33 1 ÷ 6 = 0.17 3 ÷ 6 = 0.50
middle 1 ÷ 5 = 0.20 2 ÷ 5 = 0.40 2 ÷ 5 = 0.40
passive 3 ÷ 12 = 0.25 2 ÷ 12 = 0.17 7 ÷ 12 = 0.58

Task 7: Insight about user's behaviors

Most active/used app

One interesting fact would be to learn the most used application by an user. Here we group the dataset by current_app then sum current_app_foreground_time, from these the app with maximum foregound time is stored for each user. Finally the info is plotted into the first figure. From the second figure we learn the most used app of all time.

Activity of all users per date

From the following figure we can observe activity of all users per date.

Notice that the users are barely active/inactive on weekends and Spanish public holidays. In other words, less users are activue on weekends/holidays.

For example there is no activity on 06.12.2019 celebrated as Constitution Day and 25.12.2019 celebrated as Christmas Day etc.

Another interesting fact is that, a typo in dataset describing paper was found regarding a feature: click_speed_aveage_N on page 6 in the first row of table 4, it should be click_speed_average_N.

Remarks